ADS Lunch Lecture Series

Packages and functions that I use

library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(ggplot2)  # Plotting suite
library(boot)     # Bootstrapping methods
library(mice)     # Dataset
library(DAAG)     # CV methods

Today

Resampling Methods

  1. Modeling in general
  2. Cross Validation
  3. the Jacknife
  4. the Bootstrap

So far (and something new)

Modeling

If we take a simple model in its general notation

\[y=\alpha+\beta x+\epsilon,\]

also known as \[\mathbb{E}[y] = \alpha + \beta x.\]

with consequence \[y = \mathbb{E}[y] + \epsilon.\]

At the end of the day, we still have a model with estimated parameters.

The obvious?

George Box correctly noted that \[\text{ALL MODELS ARE WRONG, BUT SOME ARE USEFUL}\]

He also stated \[\text{HOW WRONG DO THEY NEED TO BE TO NOT BE USEFUL?}\]

These remarks are important; because the implications of of being wrong are not obvious to everyone.

Correct model

In general, the only correct model is the true data generating model.

  • If you can recreate the true data generating model; then the resulting sampled measurements of said model are technically flawless and its estimates are unbiased.

Errors and residuals

Cross-validation

In general

In modeling, the data is used to obtain estimates of the observed data

Rules

  1. A useful model aims to explain as much about the data problem as possible
  2. Simpler models are usually preferred over more complex models
  3. The best performing model with the least parameters is considered superior

Buts, Ifs and What abouts?

  1. But a model is data-driven and depends on the data it is fitted on!
  2. If you use the data for fitting AND evaluation, isn’t it double dipping?
  3. What can then be said about the model’s performance on new data?

\[\text{A MODEL THAT FITS ONLY ON A GIVEN SET OF DATA}\\ \text{IS NOT USEFUL TO ANYTHING OUTSIDE THAT SET}\]

Some concepts

  • Test error: The average error for a method when applied to new data, not used for training the model
  • Training error: The same, but when applied to the data it has been trained on

Why are these concepts important?

  1. Most methods particularly aim to minimize the training MSE
  2. By minimizing the training MSE the focus lies on the sample, and not on the true data generating model (TDGM)
  3. A different sample from the TDGM would be nice to evaluate the model
  4. Unfortunately, such a test set (as in 3) is not always available
  5. Given that the training error can thus severely underestimate the test error
  6. The test error rate and training error rate are often quite different

Cross-validation

In cross-validation, the data are divided into the necessary two distinct sets:

  1. a training set on which the the model is fitted
  2. a validation set on which the model is evaluated

Benefits

  • The validation set allows to estimate the test error rate
  • The performance of the model to external sources can be assessed
  • The performance of methods without parameters can be assessed
  • The variance components related to the model estimation are taken into account

Some flavors of CV

  • Validation set CV: randomly divide the data into two sets

    • fewer cases may influence the result –> high bias
    • random split may influence the result –> high variance
  • Leave One Out CV (LOOCV): Fit the model on all cases, except one. Evaluate on that case. Repeat for all other cases. Average.

    • less bias than Validation Set CV
    • less variable than Validation Set CV
    • computationally intensive: \(n\) cross-validations, where \(n\) is the number of cases
  • k-Fold CV: Leave more than one out, such that only \(k<n\) cross validations need to be done.

    • more bias than LOOCV
    • less variance than LOOCV
      • empirically proven \(5 \leq k \geq 10\)
      • no holy \(k\)

Data example: anscombe

the anscombe data set

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Anscombe, Francis J. (1973). Graphs in statistical analysis.
American Statistician, 27, 17–21.

Simple statistical properties

colMeans(anscombe)
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
anscombe %>% var %>% diag
##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249

linear models

anscombe %$% lm(y1 ~ x1) %>% coef
## (Intercept)          x1 
##   3.0000909   0.5000909
anscombe %$% lm(y2 ~ x2) %>% coef
## (Intercept)          x2 
##    3.000909    0.500000
anscombe %$% lm(y3 ~ x3) %>% coef
## (Intercept)          x3 
##   3.0024545   0.4997273
anscombe %$% lm(y4 ~ x4) %>% coef
## (Intercept)          x4 
##   3.0017273   0.4999091

Contemporary example

A model

Let’s use a simple anscombe data example

fit <- anscombe %$%
  lm(y1 ~ x1)

summary(fit)
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Fitted values:

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)
##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

predict(fit, interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

Assessing predictive accuracy

K-fold cross-validation

  • Divide sample in \(k\) mutually exclusive training sets
  • Do for all \(j\in\{1,\dots,k\}\) training sets

    1. fit model to training set \(j\)
    2. obtain predictions for test set \(j\) (remaining cases)
    3. compute residual variance (MSE) for test set \(j\)
  • Compare MSE in cross-validation with MSE in sample
  • Small difference suggests good predictive accuracy

Anscombe

DAAG::CVlm(anscombe, fit, printit=F)

K-fold cross-validation anscombe data

  • residual variance sample is \(1.24^2 \approx 1.53\)
  • residual variance cross-validation is 1.93
  • regression lines in the 3 folds are similar
DAAG::CVlm(anscombe, fit, plotit=T, printit=F)

Another model

boys.fit <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

K-fold cross-validation boys data

  • residual variance sample is 2
  • residual variance cross-validation is 2.38
  • regression lines in the 3 folds almost identical
DAAG::CVlm(na.omit(boys), boys.fit, plotit=T, printit=F)

Parameter estimation

Running example

  • a country called sumR
  • 3 million adult humans inhabit it
  • 3 cities Rnhem, AlkmR, and LeeuwRden
  • height
  • shoe size
set.seed(54321)
 
height   <- rnorm(3e6, mean = 180, sd = 7.5)
shoesize <- 35 + 0.02*height + runif(3e6, -1, 1)
city     <- sample(c("Rnhem", "AlkmR", "LeeuwRden"), size = 3e6, 
                   prob = c(0.6, 0.3, 0.1), replace = TRUE)

sumR <- data.frame(height, shoesize, city)

Running example

par(mfrow = c(1,3))
hist(height,   breaks = 80, col = "light green", border = "transparent")
hist(shoesize, breaks = 80, col = "light blue",  border = "transparent")
table(city) %>% barplot(col = "pink")

(In a real country we will not know these things)

Let’s do statistics!

We randomly sample 100 persons and “measure” their height, shoesize and city:

samp100 <- sample(3e6, size = 100)
sdat100 <- sumR[samp100, ]
head(sdat100)
##           height shoesize      city
## 2908895 177.8716 39.28683     Rnhem
## 236907  170.2528 38.97513     Rnhem
## 1230916 195.9043 38.27168 LeeuwRden
## 1937234 177.4607 38.39331     Rnhem
## 2782743 172.3269 38.89885     AlkmR
## 2551759 169.6376 39.02313     Rnhem

Let’s use this sample to estimate quantities and infer things about our population!

Goal

We want to know the mean height of our population

The sample mean \(\bar{x}\) is an estimator for the population mean \(\mu\)

m <- mean(sdat100$height)
m
## [1] 179.7258

The realised design has mean:

sumR$height %>% mean
## [1] 179.9988

The sampling distribution

The sampling distribution: parametric

This statistic has a sampling distribution: taking a different sample generates a different mean value

Thus we have uncertainty about our estimated mean

The sampling distribution: parametric

  • The sampling distribution of the mean is a normal distribution with parameters:
    • mean \(\mu_\bar{x}\)
    • standard deviation \(\sigma_\bar{x}\)
  • We can estimate these quantities
  • Mean estimator: sample mean \[\hat{\mu}_\bar{x} = \bar{x}\]
  • Standard deviation estimator: the standard error \[\hat{\sigma}_\bar{x} = \frac{s_x}{\sqrt{n}}\]
se <- sd(sdat100$height)/sqrt(100)
se
## [1] 0.6673471

The sampling distribution: parametric

Parametric Confidence Interval

Using the standard error and the assumption of normality, we can create a confidence interval (CI):

ci <- c(m - 1.96*se, m + 1.96*se)

The sampling distribution: parametric

Interim conclusion

  • Any sample statistic has uncertainty due to sampling
  • This uncertainty is captured in the sampling distribution
  • Asymptotically (as sample size grows) we assume it is a normal distribution
  • The standard error is a measure of uncertainty
  • Using the standard error, we can create a 95% CI

The sampling distribution: algorithmic

Let’s take a step back

  • The sampling distribution of a statistic is the distribution of its estimate upon infinitely repeated sampling.
  • Since we have the full population, we can generate the sampling distribution ourselves!
  • (but instead of infinite sampling, we sample 10 000 times)
means <- numeric(10000)

for (i in 1:10000) {
  
  means[i] <- mean(sumR$height[sample(3e6, size = 100)])
  
}

The sampling distribution: algorithmic

The Jacknife

Jackknife

In the 50s and 60s, computers enabled another measure of uncertainty

The Jackknife

  • take each person out of your sample once
  • calculate the statistic on the remaining values
  • the result: 100 slightly different means
jknf <- numeric(100)

for (i in 1:100) jknf[i] <- mean(sdat100$height[-i])

mean(jknf)
## [1] 179.7258

Similar to LOOCV!

Jackknife

Jackknife

The Jackknife can be used to estimate the standard error:

\[ \begin{align} \hat{se}_{j} &= \sqrt{\frac{n-1}{n}\sum_{i=1}^n\left(\hat{\theta}_{(i)}-\overline{\hat{\theta}_{(\cdot)}}\right)^2} \end{align}\]

jkse <- sqrt(99/100*sum((jknf - mean(jknf))^2))
## Jackknife estimate: 0.667 
## Sample estimate: 0.6673471 
## 'Truth': .750

With this, we can once again create a sampling distribution and a CI based on normal theory.

Bootstrapping

Bootstrap

Introduced by Efron in 1979 as computers became more powerful.

  • As many times as you like (e.g. 1000 times)
  • Sample \(n\) values from your sample with replacement
  • Calculate the statistic of interest on this bootstrap sample
btsp <- numeric(1000)

for (i in 1:1000) btsp[i] <- mean(sdat100$height[sample(100, replace = TRUE)])

We sample from our sample, pulling ourselves “up” into the space of population by our bootstraps

Bootstrap

Bootstrap approximation

Why?

No more math!

The standard error calculation does not need to be known

Relaxes the assumption of normality

For some quantities, unreasonable sample sizes are needed before normality (e.g. product of two regression coefficients in mediation analysis)

Q: What is the sampling distribution of the log mean shoe size?

log(mean(sdat100$shoesize))
## [1] 3.653499

Log mean shoe size

Proportion of humans in LeeuwRden

The boot package

mystat <- function(dat, index) {
  median(dat$height[index])/mean(dat$shoesize[index])
}

bootstat <- sdat100 %>% boot(mystat, R = 1000)

# SE
sd(bootstat$t)
## [1] 0.02899687
# 95% CI
bootstat$t %>% quantile(c(0.025, 0.975))
##     2.5%    97.5% 
## 4.596886 4.698511

The boot package

# Distribution
data.frame(mystat = bootstat$t) %>% 
  ggplot(aes(x = mystat)) +
  geom_density(fill = "light seagreen") +
  xlim(4.5, 4.8) +
  theme_minimal()

Conclusion

  • Sampling distribution indicates uncertainty
  • Bootstrapping approximates the sampling distribution
  • Treats sample as source to sample from
  • Works when the shape of the sampling distribution is unknown
  • Package boot can be used for bootstrapping

For fun